Job Management Requirements for NAS Parallel Systems and Clusters

نویسندگان

  • William Saphir
  • Leigh Ann Tanner
  • Bernard Traversat
چکیده

A job management system is a critical component of a production supercomputing environment, permitting oversubscribed resources to be shared fairly and efficiently. Job management systems that were originally designed for traditional vector supercomputers are not appropriate for the distributed-memory parallel supercomputers that are becoming increasingly important in the high performance computing industry. Newer job management systems offer new functionality but do not solve fundamental problems. We address some of the main issues in resource allocation and job scheduling we have encountered on two parallel computers — a 160-node IBM SP2 and a cluster of 20 high performance workstations located at the Numerical Aerodynamic Simulation facility. We describe the requirements for resource allocation and job management that are necessary to provide a production supercomputing environment on these machines, prioritizing according to difficulty and importance, and advocating a return to fundamental issues.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Novel HPC Technologies for Scalable CAE: The Case for Parallel I/O and File Systems

As HPC continues its aggressive platform migration from proprietary supercomputers and Unix servers to HPC clusters, expectations grow for clusters to meet the I/O demands of increasing fidelity in CAE modeling and data management in the CAE workflow. Cluster deployments have increased as organizations seek ways to costeffectively grow compute resources for CAE applications, and during this mig...

متن کامل

ACL 2 for Parallel Systems Software : A Progress Report

A significant development in high-performance computing has occurred in recent years with the proliferation of “Beowulf” clusters [6]. Beowulf clusters are parallel computers assembled from commodity-priced personal computers and networks. The explosive growth of the personal computer marketplace, together with rapid technological advances in the hardware sold there, has driven the price/perfor...

متن کامل

A Comparison of Workload Traces from Two Production Parallel Machines

The analysis of workload traces from real production parallel machines can aid a wide variety of parallel processing research, providing a realistic basis for experimentation in the management of resources over an entire workload. We analyze a ve-month workload trace of an Intel Paragon machine supporting a production parallel workload at the San Diego Supercomputer Center (SDSC), comparing and...

متن کامل

Object Storage: Scalable Bandwidth for HPC Clusters

This paper describes the Object Storage Architecture solution for cost-effective, high bandwidth storage in High Performance Computing (HPC) environments. An HPC environment requires a storage system to scale to very large sizes and performance without sacrificing cost-effectiveness nor ease of sharing and managing data. Traditional storage solutions, including disk-per-node, Storage-Area Netwo...

متن کامل

JAMILA: A Usable Batch Job Management System to Coordinate Heterogeneous Clusters and Diverse Applications over Grid or Cloud Infrastructure

Usability is an important feature of Grids or Clouds to end users, who may not be computer professionals but need to use massive machines to compute their jobs. For meeting various computing or management requirements, heterogeneous clusters with diverse Distributed Resource Management Systems (D-RMS) and applications are needed to supply computing services in Grids or Clouds. The heterogeneity...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995